Dataset contains 74 unique Assignment Groups

EXPLORATORY DATA ANALYSIS

Basic Data Cleaning

There are null values in columns - 'Short description' & 'Description'

Remove null values by converting it to string

The converted mojibakes data seems to be Mandarin. Thus there is presence of other non-english language that need to be converted.

Language detection

Observation

Data Cleaning

Most Common words before data cleaning and removing stop words

Need to remove stop words and unnecessary words that do not add meaning to the content.

Remove stop words

Lemmatization

Observations

Drop tickets with just one word in Description

Observations

Tokenization

Visualize percentage of tickets per assignment group

Interactive plot

Pareto Chart

Observations

Top 20 Assignment groups with highest number of tickets

Bottom 20 Assignment groups with lowest number of tickets

Distribution of tickets within groups less than 30 tickets

Observations

Drop groups that has just one ticket each

Word Cloud

Top 4 Group word cloud to understand the group's field of operation

Group 0

Observations

Observations

We find that the similar groups in group 0 and 8 do not overlap and can see clear distinction in the type of field and level of work

Comparison between Group-0(L1/L2) and Group-8(L3)

Observations - L1/L2 type ticket counts are quiet higher than L3 tickets in the dataset

Group-12

Observations - Group-12 tickets mostly revolve around server, asa deny, dst outside, outside access.

Group-24

Observations - Group-24 consist of german language which needs translation

Caller Distribution Analysis

Top 4 caller word cloud

Observations

Observations

Data Pre-Processing

N-Grams

Uni-Gram

Bi-Grams

Tri-Grams

Feature Extraction

Calculating TF-IDF

Observations

Perform LDA - Dimensionality Reduction

Observations

Tranforming tokens to vector using TF-IDF

Save the cleansed vectorized and non vectorized dataset into a csv file for future modelling purpose

Label Encoding / One Hot Encoding

Label Encoding

Modelling

List of Machine Learning models to try out:

Statistical ML Models

Neural Network Models

Creating Model with complete imbalance dataset

Statistical ML Models

The no. of unique targets in train is not the same with test due to high imbalance in the data, having as low as 2 tickets in few groups before splitting.

Multinomial Logistic Regression

Support Vector Machine

Stochastic Gradient Descent

Multinomial Naive Bayes

K Nearest neighbor

Random Forest

Decision trees

xgboost

Observations:

Deep Learning Models

Neural Network